Here, will attempt to shocase the different demographic groups of users that share similar behaviour on the Starbucks mobile app. I will be using the following methods:
FMT segmentation (frequency, monetary value, tenure) - group users into quantiles for each of these metrics and score customers using all 3 quantile features, then segments will be created based on the total scoresK-means clustering - group users into clusters based on demographic and FMT features%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as subplots
sns.set()
import gc
import joblib
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.cluster import KMeans
# Data
profile = pd.read_pickle('../data/output1/profile.pkl')
transcript = pd.read_pickle('../data/output1/transcript.pkl')
offers = pd.read_pickle('../data/output1/offers.pkl')
profile.shape, transcript.shape, offers.shape
((14825, 10), (272388, 7), (148431, 26))
print(profile.info())
profile.head()
<class 'pandas.core.frame.DataFrame'> Int64Index: 14825 entries, 0 to 14824 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 14825 non-null object 1 age 14825 non-null int64 2 income 14825 non-null int32 3 user_id 14825 non-null int64 4 user_date 14825 non-null datetime64[ns] 5 signup_year 14825 non-null int64 6 age_group 14825 non-null int32 7 income_group 14825 non-null int32 8 amount 14825 non-null float64 9 spend_group 14825 non-null int32 dtypes: datetime64[ns](1), float64(1), int32(4), int64(3), object(1) memory usage: 1.0+ MB None
| gender | age | income | user_id | user_date | signup_year | age_group | income_group | amount | spend_group | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 55 | 112000 | 1 | 2017-07-15 | 2017 | 3 | 5 | 77.01 | 5 |
| 1 | F | 75 | 100000 | 2 | 2017-05-09 | 2017 | 5 | 5 | 159.27 | 8 |
| 2 | M | 68 | 70000 | 3 | 2018-04-26 | 2018 | 4 | 3 | 57.73 | 4 |
| 3 | M | 65 | 53000 | 4 | 2018-02-09 | 2018 | 4 | 2 | 36.43 | 3 |
| 4 | M | 58 | 51000 | 5 | 2017-11-11 | 2017 | 3 | 2 | 15.62 | 2 |
print(transcript.info())
transcript.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 272388 entries, 0 to 272387 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 272388 non-null int64 1 event 272388 non-null object 2 time 272388 non-null int64 3 amount 272388 non-null float64 4 offer_id 272388 non-null int32 5 reward 272388 non-null int32 6 user_id 272388 non-null int64 dtypes: float64(1), int32(2), int64(3), object(1) memory usage: 12.5+ MB None
| index | event | time | amount | offer_id | reward | user_id | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 0.0 | 7 | 0 | 2 |
| 1 | 2 | offer received | 0 | 0.0 | 5 | 0 | 3 |
| 2 | 5 | offer received | 0 | 0.0 | 8 | 0 | 4 |
| 3 | 7 | offer received | 0 | 0.0 | 1 | 0 | 5 |
| 4 | 8 | offer received | 0 | 0.0 | 6 | 0 | 6 |
print(offers.info())
offers.head()
<class 'pandas.core.frame.DataFrame'> Int64Index: 148431 entries, 0 to 148430 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 148431 non-null int64 1 event 148431 non-null object 2 time 148431 non-null int64 3 offer_id 148431 non-null int32 4 user_id 148431 non-null int64 5 amount 148431 non-null float64 6 reward 148431 non-null int64 7 difficulty 148431 non-null int64 8 duration 148431 non-null int64 9 offer_type 148431 non-null object 10 mobile 148431 non-null int64 11 social 148431 non-null int64 12 web 148431 non-null int64 13 expire_time 148431 non-null int64 14 off_id_use_id 148431 non-null object 15 viewed 148431 non-null int64 16 completed 148431 non-null int64 17 group 148431 non-null int64 18 gender 148431 non-null object 19 age 148431 non-null int64 20 income 148431 non-null int32 21 user_date 148431 non-null datetime64[ns] 22 signup_year 148431 non-null int64 23 age_group 148431 non-null int32 24 income_group 148431 non-null int32 25 spend_group 148431 non-null int32 dtypes: datetime64[ns](1), float64(1), int32(5), int64(15), object(4) memory usage: 27.7+ MB None
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | ... | completed | group | gender | age | income | user_date | signup_year | age_group | income_group | spend_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 7 | 2 | 0.0 | 5 | 5 | 7 | bogo | ... | 1 | 4 | F | 75 | 100000 | 2017-05-09 | 2017 | 5 | 5 | 8 |
| 1 | 2 | offer received | 0 | 5 | 3 | 0.0 | 2 | 10 | 7 | discount | ... | 0 | 2 | M | 68 | 70000 | 2018-04-26 | 2018 | 4 | 3 | 4 |
| 2 | 5 | offer received | 0 | 8 | 4 | 0.0 | 5 | 5 | 5 | bogo | ... | 1 | 4 | M | 65 | 53000 | 2018-02-09 | 2018 | 4 | 2 | 3 |
| 3 | 7 | offer received | 0 | 1 | 5 | 0.0 | 0 | 0 | 4 | informational | ... | 0 | 1 | M | 58 | 51000 | 2017-11-11 | 2017 | 3 | 2 | 2 |
| 4 | 8 | offer received | 0 | 6 | 6 | 0.0 | 5 | 20 | 10 | discount | ... | 0 | 2 | F | 61 | 57000 | 2017-09-11 | 2017 | 4 | 2 | 5 |
5 rows × 26 columns
This takes into account RFM (recency, frequency, monetary) segmentation, but since there is only 1 month of data, recency won't be of any use. Instead of recency, I will be looking at tenure. For each customer, this section will explore:
# Extract transactions
transaction = transcript.query('event == "transaction"').drop('event', axis=1)
# Merge with user data
transaction = pd.merge(transaction, profile.drop(['amount', 'spend_group'], axis=1), on='user_id', how='left')
transaction.head()
| index | time | amount | offer_id | reward | user_id | gender | age | income | user_date | signup_year | age_group | income_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12654 | 0 | 0.83 | 0 | 0 | 29 | F | 20 | 30000 | 2016-07-11 | 2016 | 1 | 1 |
| 1 | 12657 | 0 | 34.56 | 0 | 0 | 40 | M | 42 | 96000 | 2016-01-17 | 2016 | 2 | 5 |
| 2 | 12659 | 0 | 13.23 | 0 | 0 | 46 | M | 36 | 56000 | 2017-12-28 | 2017 | 1 | 2 |
| 3 | 12670 | 0 | 19.51 | 0 | 0 | 108 | F | 55 | 94000 | 2017-10-16 | 2017 | 3 | 5 |
| 4 | 12671 | 0 | 18.97 | 0 | 0 | 124 | F | 39 | 67000 | 2017-12-17 | 2017 | 1 | 3 |
# Create a binary feature indicating whether the transaction amount was at least $1
transaction['gte1'] = transaction.amount.apply(lambda a: int(a >= 1))
# Calculate the frequency, monetary value for all users
fmt = transaction.groupby('user_id', as_index=False).agg({
'gte1': 'sum', # frequency
'amount': 'sum', # monetary
})
fmt.columns = ['user_id', 'frequency', 'monetary']
# Add users with no transactions
fmt = pd.merge(profile[['user_id', 'user_date']], fmt, on='user_id', how='left')
fmt = fmt.set_index('user_id').fillna(0)
fmt['frequency'] = fmt.frequency.astype(int)
fmt.shape
(14825, 3)
# Calculate a reference date for tenure - 30 days after the latest signup date
ref_date = profile.user_date.max() + pd.Timedelta(30, 'days')
# Calculate the tenure for all users
fmt['tenure'] = fmt.user_date.apply(lambda d: (ref_date - d).days // 30)
fmt.drop('user_date', axis=1, inplace=True)
fmt.head()
| frequency | monetary | tenure | |
|---|---|---|---|
| user_id | |||
| 1 | 3 | 77.01 | 13 |
| 2 | 7 | 159.27 | 15 |
| 3 | 3 | 57.73 | 4 |
| 4 | 3 | 36.43 | 6 |
| 5 | 3 | 15.62 | 9 |
The above mentioned FMT values are calculated for each user as-
Frequency - number of transactions made, where the amount is atleast $1Monetary value - total amount spent by each userTenure - number of months from signup date(user_date) to the reference date# Reduce FMT values into quantiles
fmt['F'] = pd.qcut(fmt.frequency, 6, range(1, 7)).astype(int)
fmt['M'] = pd.qcut(fmt.monetary, 8, range(1, 9)).astype(int)
fmt['T'] = pd.qcut(fmt.tenure, 3, range(1, 4)).astype(int)
# Aggregate quantile values
fmt['FMT'] = fmt['F'].astype(str) + fmt['M'].astype(str) + fmt['T'].astype(str) # concat as str for segment name
fmt['score'] = fmt['F'] + fmt['M'] + fmt['T'] # sum as int for customer score
fmt.head()
| frequency | monetary | tenure | F | M | T | FMT | score | |
|---|---|---|---|---|---|---|---|---|
| user_id | ||||||||
| 1 | 3 | 77.01 | 13 | 1 | 4 | 2 | 142 | 7 |
| 2 | 7 | 159.27 | 15 | 3 | 6 | 2 | 362 | 11 |
| 3 | 3 | 57.73 | 4 | 1 | 4 | 1 | 141 | 6 |
| 4 | 3 | 36.43 | 6 | 1 | 3 | 1 | 131 | 5 |
| 5 | 3 | 15.62 | 9 | 1 | 2 | 1 | 121 | 4 |
def fmt_means(fmt_df, cols=['frequency', 'monetary', 'tenure'],
group_col='score', count_col='FMT'):
'''
Calculate the frequency, monetary, and tenure (FMT) means for each segment, as well as
the number and percentage of users in each segment.
Args:
(1) fmt_df (Pandas dataframe) - FMT data
(2) cols (list[str]) - names of columns to find the mean for
(3) group_col (str) - column to group by
(4) count_col (str) - column to count in the grouping
Returns:
Frequency, monetary, and tenure (FMT) means and number and percentage of users
for each segment (Pandas dataframe)
'''
df = fmt_df.groupby(group_col).mean()[cols].round(2)
df['n_users'] = fmt_df.groupby(group_col)[count_col].count()
df['pct_users'] = (100 * df.n_users / fmt_df.shape[0]).round(2)
return df
# Calculate the FMT means and user counts for each score segment
fmt_means(fmt)
| frequency | monetary | tenure | n_users | pct_users | |
|---|---|---|---|---|---|
| score | |||||
| 3 | 1.62 | 6.35 | 5.01 | 951 | 6.41 |
| 4 | 2.66 | 12.58 | 7.47 | 943 | 6.36 |
| 5 | 3.53 | 26.82 | 7.08 | 1052 | 7.10 |
| 6 | 4.34 | 37.65 | 8.59 | 1047 | 7.06 |
| 7 | 4.79 | 51.64 | 12.12 | 840 | 5.67 |
| 8 | 5.30 | 69.95 | 14.52 | 907 | 6.12 |
| 9 | 6.22 | 78.77 | 19.03 | 921 | 6.21 |
| 10 | 7.40 | 92.34 | 21.26 | 1026 | 6.92 |
| 11 | 8.90 | 107.15 | 24.83 | 1300 | 8.77 |
| 12 | 9.88 | 131.55 | 23.22 | 1218 | 8.22 |
| 13 | 11.51 | 151.20 | 25.74 | 1350 | 9.11 |
| 14 | 11.16 | 213.38 | 23.00 | 1076 | 7.26 |
| 15 | 12.48 | 264.39 | 23.10 | 1021 | 6.89 |
| 16 | 14.43 | 311.64 | 25.20 | 844 | 5.69 |
| 17 | 16.93 | 331.60 | 30.70 | 329 | 2.22 |
# Create 3 user tiers based on score: 3 - 7, 8 - 12, 13 - 17
fmt['tier'] = (fmt.score + 2) // 5
fmt['user_tier'] = fmt.tier.map({1: 'Red', 2: 'Yellow', 3: 'Blue'})
# Calculate the FMT means and user counts for each tier segment
fmt_means(fmt, group_col='tier')
| frequency | monetary | tenure | n_users | pct_users | |
|---|---|---|---|---|---|
| tier | |||||
| 1 | 3.38 | 26.67 | 7.95 | 4833 | 32.60 |
| 2 | 7.77 | 98.71 | 21.05 | 5372 | 36.24 |
| 3 | 12.56 | 232.85 | 24.77 | 4620 | 31.16 |
def snake_plot(segment_df, feats=['frequency', 'monetary', 'tenure'],
normalize=True, segment_col='tier', user_col='user_id',
title='Snake Plot', legend_title='User Tier', palette=None):
'''
Create a snake plot for the specified features of customer segments.
Args:
(1) segment_df (Pandas dataframe) - user data with defined segments
(2) feats (list[str]) - names of columns to plot
(3) normalize (bool) - whether to normalize the data before plotting
(4) segment_col (str) - name of segment column
(5) user_col (str) - name of user ID column
(6) title (str) - plot name
(7) legend_title (str) - legend name
(8) palette (list[str] or str) - named color palette or list of colors for bars
Returns: None.
'''
# Set index
if user_col in segment_df.columns:
segment_df.set_index(user_col, inplace=True)
# Normalize values
if normalize:
df = pd.DataFrame(StandardScaler().fit_transform(segment_df[feats]), index=segment_df.index, columns=feats)
df[segment_col] = segment_df[segment_col]
else:
df = segment_df[feats]
# Melt features
melt = pd.melt(df.reset_index(),
id_vars=[user_col, segment_col], value_vars=feats,
var_name='Feature', value_name='Value')
# Create a snake plot for customer segments
title += ' for ' + legend_title + 's'
sns.lineplot(data=melt, x='Feature', y='Value', hue=segment_col, palette=palette)
plt.title(title, fontsize='x-large')
plt.legend(title=legend_title, bbox_to_anchor=(1, 1))
plt.show()
# Snake plot
snake_plot(fmt, palette=['red', 'yellow', 'blue'])
The FMT values are reduced to quantiles in order to score each user. The number of quantiles chosen for each feature is the weight for which the segmentation is based on:
Frequency - 6 quantiles (1 - 6 points)Monetary value - 8 quantiles (1 - 8 points)Tenure - 3 quantiles (1 - 3 points)** Example: a user in the top quantile for all 3 features will get a total score of 17 (6 + 8 + 3)
The weights are chosen arbitrarily, but the general idea is that monetary value is the most important feature here since it is the main focus of this project. It is followed by frequency and then tenure, which would both have more weight if we are focusing on how customers are using the app, but this is not the case.
The individual scores for each are summed up for a maximum of 17 points, which created 15 cohorts of users (3 - 17 points). Since there are too many cohorts and the cohort sizes aren't very balanced, I grouped the cohorts in 3 tiers of customers: red customers have a total score between 3 and 7, yellow between 8 and 12, and blue between 13 and 17.
As seen in the snake plot above, the average of all 3 FMT metrics increase with an increasing user tier, although not at the same rate.
# Merge FMT data into offer events
offers = pd.merge(offers, fmt.reset_index()[['user_id', 'frequency', 'monetary', 'tenure', 'user_tier']], on='user_id', how='left')
# Extract true offers
true_offers = offers.query('event == "offer received" and group != 3')
print(true_offers.shape)
true_offers.head()
(55004, 30)
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | ... | income | user_date | signup_year | age_group | income_group | spend_group | frequency | monetary | tenure | user_tier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 7 | 2 | 0.0 | 5 | 5 | 7 | bogo | ... | 100000 | 2017-05-09 | 2017 | 5 | 5 | 8 | 7 | 159.27 | 15 | Yellow |
| 1 | 2 | offer received | 0 | 5 | 3 | 0.0 | 2 | 10 | 7 | discount | ... | 70000 | 2018-04-26 | 2018 | 4 | 3 | 4 | 3 | 57.73 | 4 | Red |
| 2 | 5 | offer received | 0 | 8 | 4 | 0.0 | 5 | 5 | 5 | bogo | ... | 53000 | 2018-02-09 | 2018 | 4 | 2 | 3 | 3 | 36.43 | 6 | Red |
| 3 | 7 | offer received | 0 | 1 | 5 | 0.0 | 0 | 0 | 4 | informational | ... | 51000 | 2017-11-11 | 2017 | 3 | 2 | 2 | 3 | 15.62 | 9 | Red |
| 4 | 8 | offer received | 0 | 6 | 6 | 0.0 | 5 | 20 | 10 | discount | ... | 57000 | 2017-09-11 | 2017 | 4 | 2 | 5 | 6 | 85.55 | 11 | Yellow |
5 rows × 30 columns
def plot_offer_completion(data, hue, hue_order, x='offer_id', y='completed', palette=None):
'''
Plot offer completion rate by customer segment.
Args:
(1) data (Pandas dataframe) - data to plot
(2) hue (str) - name of segment feature column
(3) hue_order (list) - order of hue values
(4) x (str) - name of offer ID column
(5) y (str) - name of completed indicator column
(6) palette (list[str] or str) - named color palette or list of colors for bars
Returns: None.
'''
title = hue.replace("_", " ").title()
plt.figure(figsize=(16, 4))
sns.barplot(data=data, x=x, y=y, hue=hue, hue_order=hue_order, palette=palette)
plt.title(f'Completion Rate by {title}', fontsize='xx-large')
plt.ylabel('Proportion of offers completed')
plt.xlabel(x.replace('_', ' ').title())
plt.legend(title=title, bbox_to_anchor=(1, 1))
plt.show()
plot_offer_completion(true_offers, 'user_tier', ['Red', 'Yellow', 'Blue'],
palette=['red' ,'yellow', 'blue'])
red and blue customers for all offers# Merge user data into FMT data
fmt = pd.merge(fmt.reset_index(), profile, on='user_id', how='left')
fmt.head()
| user_id | frequency | monetary | tenure | F | M | T | FMT | score | tier | user_tier | gender | age | income | user_date | signup_year | age_group | income_group | amount | spend_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | 77.01 | 13 | 1 | 4 | 2 | 142 | 7 | 1 | Red | F | 55 | 112000 | 2017-07-15 | 2017 | 3 | 5 | 77.01 | 5 |
| 1 | 2 | 7 | 159.27 | 15 | 3 | 6 | 2 | 362 | 11 | 2 | Yellow | F | 75 | 100000 | 2017-05-09 | 2017 | 5 | 5 | 159.27 | 8 |
| 2 | 3 | 3 | 57.73 | 4 | 1 | 4 | 1 | 141 | 6 | 1 | Red | M | 68 | 70000 | 2018-04-26 | 2018 | 4 | 3 | 57.73 | 4 |
| 3 | 4 | 3 | 36.43 | 6 | 1 | 3 | 1 | 131 | 5 | 1 | Red | M | 65 | 53000 | 2018-02-09 | 2018 | 4 | 2 | 36.43 | 3 |
| 4 | 5 | 3 | 15.62 | 9 | 1 | 2 | 1 | 121 | 4 | 1 | Red | M | 58 | 51000 | 2017-11-11 | 2017 | 3 | 2 | 15.62 | 2 |
def plot_user_demographics(user_df, title='User Demographics', height=250, width=1000,
gender_col='gender', age_col='age', income_col='income'):
'''
Plot the distribution of user demographics - `gender`, `age`, and `income`.
Args:
(1) user_df (Pandas dataframe) - user data
(2) title (str) - plot title
(3) height (int) - figure height
(4) width (int) - figure width
(5) gender_col (str) - name of gender column
(6) age_col (str) - name of age column
(7) income_col (str) - name of income column
Returns: None
'''
fig = subplots.make_subplots(1, 3, horizontal_spacing=0.05, y_title='Number of users',
subplot_titles=['Gender', 'Age', 'Income'])
fig.add_trace(go.Histogram(x=user_df[gender_col].sort_values(), name='Gender'), 1, 1) # gender
fig.add_trace(go.Histogram(x=user_df[age_col], name='Age'), 1, 2) # age
fig.add_trace(go.Histogram(x=user_df[income_col], name='Income'), 1, 3) # income
fig.update_layout(title=title, height=height, width=width,
margin=dict(t=80, r=0, b=0, l=80), showlegend=False)
fig.show()
# Plot user demographics for each customer tier
tier_demo = []
for t in ['Red', 'Yellow', 'Blue']:
fmt_tier = fmt[fmt.user_tier == t]
tier_demo.append([fmt_tier.shape[0], int(fmt_tier.age.mean()), fmt_tier.income.mean()])
plot_user_demographics(fmt_tier, title=f'{t} Customer Demographics')
pd.DataFrame(tier_demo, index=range(1, 4), columns=['n_users', 'avg_age', 'avg_income'])
| n_users | avg_age | avg_income | |
|---|---|---|---|
| 1 | 4833 | 52 | 60382.785020 |
| 2 | 5372 | 55 | 66598.845867 |
| 3 | 4620 | 54 | 69270.562771 |
del ref_date, true_offers, tier_demo, fmt_tier, t
gc.collect()
12403
With an increasing customer tier, we can see the following changes:
Gender - proportion of female users increases relative to male usersred group, there are more than twice as many male users as female usersblue group, the number of female and male users are almost equalAge - number of users under 40 decreasesIncome - number of lower earners decreases and number of higher earners increaseblue users is about 9k higher than the average income of red usersAs red users do not spend a lot of money, they're not very likely to respond to offers so it would be a good idea to either stop sending them offers or to only send them offers that are easy to complete.
Yellow users do spend quite a bit more than red users, so it's actually worth it to be sending them offers. They completed a good portion of discount offers 3 and 4 that were sent out, so focusing on the easier offers or lowering the difficulty of the harder offers would likely result in an increase the number of offers that are completed.
Blue users consistently have a high rate of offer completion so it would actually benefit Starbucks to increase the difficulty of offers being sent to these users. As they are highly likely to respond to offers, a higher difficulty would likely increase the amount these customers spend.
# Group "female" and "other" gender into a new binary feature
fmt['male'] = (fmt.gender == 'M').astype(int)
fmt.head()
| user_id | frequency | monetary | tenure | F | M | T | FMT | score | tier | ... | gender | age | income | user_date | signup_year | age_group | income_group | amount | spend_group | male | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | 77.01 | 13 | 1 | 4 | 2 | 142 | 7 | 1 | ... | F | 55 | 112000 | 2017-07-15 | 2017 | 3 | 5 | 77.01 | 5 | 0 |
| 1 | 2 | 7 | 159.27 | 15 | 3 | 6 | 2 | 362 | 11 | 2 | ... | F | 75 | 100000 | 2017-05-09 | 2017 | 5 | 5 | 159.27 | 8 | 0 |
| 2 | 3 | 3 | 57.73 | 4 | 1 | 4 | 1 | 141 | 6 | 1 | ... | M | 68 | 70000 | 2018-04-26 | 2018 | 4 | 3 | 57.73 | 4 | 1 |
| 3 | 4 | 3 | 36.43 | 6 | 1 | 3 | 1 | 131 | 5 | 1 | ... | M | 65 | 53000 | 2018-02-09 | 2018 | 4 | 2 | 36.43 | 3 | 1 |
| 4 | 5 | 3 | 15.62 | 9 | 1 | 2 | 1 | 121 | 4 | 1 | ... | M | 58 | 51000 | 2017-11-11 | 2017 | 3 | 2 | 15.62 | 2 | 1 |
5 rows × 21 columns
# Normalize features
scaler = StandardScaler()
scaled = scaler.fit_transform(fmt[['male', 'age', 'income', 'frequency', 'monetary', 'tenure']])
# PCA
pca = PCA(n_components=5, random_state=0)
user_pca = pd.DataFrame(pca.fit_transform(scaled), columns=['comp' + str(i) for i in range(1, 6)])
user_pca.plot(kind='hist', bins=20, figsize=(18, 8), subplots=True, layout=(2, 3),
title='Distribution of PCA Components')
# Add components to FMT data
fmt = pd.concat([fmt, user_pca], axis=1)
pca.explained_variance_ratio_
array([0.28691816, 0.26401487, 0.14821769, 0.13255764, 0.11507985])
def heatmap(data, title, ylabel, xlabel='Feature', vmin=-1, vmax=1, cmap='RdBu'):
'''
Plot a heatmap.
Args:
(1) data (Pandas dataframe) - data to plot
(2) title (str) - plot name
(3) xlabel (str) - x-axis name
(4) ylabel (str) - y-axis name
(5) vmin (int or float) - colorbar minimum
(6) vmax (int or float) - colorbar maximum
(7) cmap (str) - named color palette
Returns: None.
'''
plt.figure(figsize=(8, 6))
sns.heatmap(data, annot=True, vmin=vmin, vmax=vmax, cmap=cmap, fmt='.3f')
plt.title(title, fontsize='x-large')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.yticks(rotation=0)
plt.show()
# PCA component loadings
user_comps = pd.DataFrame(pca.components_, index=user_pca.columns,
columns=['male', 'age', 'income', 'frequency', 'monetary', 'tenure'])
heatmap(user_comps, 'PCA Component Loadings', 'Component')
Gender is reduced to a binary feature that indicates whether the user is "male" because there is a very low number of "other" users so they are grouped with "female" users to represent non-male users. Since this is still a categorical feature and K-means clustering isn't meant for categorical features. So use PCA instaed.
I will try to characterize each component based on its loadings.
Component 1 - high spending amountComponent 2 - low income, high spending frequencyComponent 3 - female, young, newer to the appComponent 4 - male, high spending amount, newer to the appComponent 5 - older, lower income, newer to the app# Test different numbers of clusters
msd = [] # mean
for k in range(3, 12):
kmeans = KMeans(k, random_state=0)
kmeans.fit(user_pca)
msd.append(np.sqrt(kmeans.inertia_ / user_pca.shape[0]))
# Elbow method
plt.figure()
sns.lineplot(x=range(3, 12), y=msd)
plt.title('Elbow method for K-means Clustering')
plt.ylabel('Root mean squared distance')
plt.xlabel('Number of clusters')
plt.xticks(range(3, 12), range(3, 12));
# K-means clustering
kmeans = KMeans(4, random_state=0)
kmeans.fit(user_pca)
fmt['cluster'] = kmeans.labels_
# Cluster summary
feats = ['male', 'age', 'income', 'frequency', 'monetary', 'tenure', 'comp1', 'comp2', 'comp3', 'comp4', 'comp5']
cluster_mean = fmt_means(fmt, feats, 'cluster')
cluster_mean
| male | age | income | frequency | monetary | tenure | comp1 | comp2 | comp3 | comp4 | comp5 | n_users | pct_users | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cluster | |||||||||||||
| 0 | 0.00 | 58.43 | 70343.42 | 5.47 | 85.41 | 12.65 | -0.12 | -1.00 | 0.71 | -0.60 | -0.02 | 4030 | 27.18 |
| 1 | 0.75 | 46.72 | 51993.22 | 12.79 | 116.70 | 32.34 | 0.54 | 1.52 | -0.26 | -0.38 | -0.11 | 3538 | 23.87 |
| 2 | 0.45 | 63.56 | 87546.26 | 10.25 | 292.02 | 23.20 | 1.73 | -0.54 | -0.14 | 0.54 | -0.03 | 2583 | 17.42 |
| 3 | 1.00 | 51.65 | 59063.12 | 4.77 | 47.83 | 8.69 | -1.26 | 0.02 | -0.34 | 0.51 | 0.12 | 4674 | 31.53 |
# Visualize clusters
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.scatterplot(data=fmt, x='comp1', y='comp2', hue='cluster', ax=ax[0], alpha=0.8) # first 2 components
sns.scatterplot(data=fmt, x='age', y='income', hue='cluster', ax=ax[1], alpha=0.8);
fig.suptitle('K-means Visualizations - 4 clusters');
# Feature importances of
feats = ['male', 'age', 'income', 'frequency', 'monetary', 'tenure']
overall_mean = fmt[feats].mean()
feat_importances = cluster_mean.iloc[:, :6] / overall_mean - 1
heatmap(feat_importances, 'Cluster Feature Importances', 'Cluster')
# Snake plot
snake_plot(fmt, feats, segment_col='cluster', legend_title='Cluster')
fmt['customer_cluster'] = fmt.cluster.map({0: 'Female low spenders',
1: 'Low earners',
2: 'High earners',
3: 'Male low spenders'})
fmt.head()
| frequency | monetary | tenure | F | M | T | FMT | score | tier | user_tier | ... | amount | spend_group | male | comp1 | comp2 | comp3 | comp4 | comp5 | cluster | customer_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| 1 | 3 | 77.01 | 13 | 1 | 4 | 2 | 142 | 7 | 1 | Red | ... | 77.01 | 5 | 0 | 0.190966 | -2.206539 | 0.450535 | 0.000317 | -1.337460 | 0 | Female low spenders |
| 2 | 7 | 159.27 | 15 | 3 | 6 | 2 | 362 | 11 | 2 | Yellow | ... | 159.27 | 8 | 0 | 1.096733 | -2.001638 | 0.074636 | -0.226534 | -0.042568 | 0 | Female low spenders |
| 3 | 3 | 57.73 | 4 | 1 | 4 | 1 | 141 | 6 | 1 | Red | ... | 57.73 | 4 | 1 | -1.175927 | -1.002610 | -0.778918 | 0.612330 | 0.492451 | 3 | Male low spenders |
| 4 | 3 | 36.43 | 6 | 1 | 3 | 1 | 131 | 5 | 1 | Red | ... | 36.43 | 3 | 1 | -1.506981 | -0.443603 | -0.664860 | 0.264465 | 0.705047 | 3 | Male low spenders |
| 5 | 3 | 15.62 | 9 | 1 | 2 | 1 | 121 | 4 | 1 | Red | ... | 15.62 | 2 | 1 | -1.628904 | -0.129931 | -0.565190 | 0.157023 | 0.377109 | 3 | Male low spenders |
5 rows × 27 columns
Using the elbow method, I initially decided to create 7 clusters. But with 7 clusters, the cluster size is imbalanced. After several trial and erroe method, found 4 clusters to be optimal because the clusters are distinguishable and more interpretable.
The main identifying characteristics for each cluster are:
Cluster 0 - gender (-)Cluster 1 - frequency (+) and tenure (+)Cluster 2 - monetary value (+)Cluster 3 - gender (+), monetary value (-), and tenure (-)The clusters were named using the snake plot. The general description of each cluster is as follows:
Cluster 0 Low spenders - FemaleCluster 1 Low earners Cluster 2 High earners Cluster 3 Low spenders - Maledef plot_user_demographic1(segment_df, segment, color,
cols=['gender'] + feats[1:],
segment_col='customer_cluster'):
'''
Plot the distribution of user demographics and FMT features in a segment.
Args:
(1) segment_df (Pandas dataframe) - data with defined segments
(2) segment (str or int) - name of segment
(3) color (tuple(float, float, float) or str) - RGB tuple or named color for bars
(3) cols (list[str]) - names of cols to plot
(4) segment_col (str) - name of segment column
Returns: None.
'''
# Extract cluster
cluster = segment_df.loc[segment_df[segment_col] == segment, cols]
# Histogram subplots
fig, ax = plt.subplots(2, 3, figsize=(16, 6))
fig.suptitle('Demographics of Cluster: ' + segment.title(), fontsize='x-large')
for i in range(3):
sns.histplot(cluster[cols[i]], bins=20, color=color, ax=ax[0, i])
sns.histplot(cluster[cols[i + 3]], bins=20, color=color, ax=ax[1, i])
plt.tight_layout()
# Demographics of male low spenders
plot_user_demographic1(fmt, 'Male low spenders', sns.color_palette()[0])
# Demographics of female low spenders
plot_user_demographic1(fmt, 'Female low spenders', sns.color_palette()[1])
# Demographics of low earners
plot_user_demographic1(fmt, 'Low earners', sns.color_palette()[2])
# Demographics of high earners
plot_user_demographic1(fmt, 'High earners', sns.color_palette()[3])
Low spenders- MaleLow spenders- FemaleLow earnersHigh earnerslow earners, but spend the most money relative to the other groupsfmt.head()
| frequency | monetary | tenure | F | M | T | FMT | score | tier | user_tier | ... | amount | spend_group | male | comp1 | comp2 | comp3 | comp4 | comp5 | cluster | customer_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||||||||
| 1 | 3 | 77.01 | 13 | 1 | 4 | 2 | 142 | 7 | 1 | Red | ... | 77.01 | 5 | 0 | 0.190966 | -2.206539 | 0.450535 | 0.000317 | -1.337460 | 0 | Female low spenders |
| 2 | 7 | 159.27 | 15 | 3 | 6 | 2 | 362 | 11 | 2 | Yellow | ... | 159.27 | 8 | 0 | 1.096733 | -2.001638 | 0.074636 | -0.226534 | -0.042568 | 0 | Female low spenders |
| 3 | 3 | 57.73 | 4 | 1 | 4 | 1 | 141 | 6 | 1 | Red | ... | 57.73 | 4 | 1 | -1.175927 | -1.002610 | -0.778918 | 0.612330 | 0.492451 | 3 | Male low spenders |
| 4 | 3 | 36.43 | 6 | 1 | 3 | 1 | 131 | 5 | 1 | Red | ... | 36.43 | 3 | 1 | -1.506981 | -0.443603 | -0.664860 | 0.264465 | 0.705047 | 3 | Male low spenders |
| 5 | 3 | 15.62 | 9 | 1 | 2 | 1 | 121 | 4 | 1 | Red | ... | 15.62 | 2 | 1 | -1.628904 | -0.129931 | -0.565190 | 0.157023 | 0.377109 | 3 | Male low spenders |
5 rows × 27 columns
offers.head()
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | ... | income | user_date | signup_year | age_group | income_group | spend_group | frequency | monetary | tenure | user_tier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 7 | 2 | 0.0 | 5 | 5 | 7 | bogo | ... | 100000 | 2017-05-09 | 2017 | 5 | 5 | 8 | 7 | 159.27 | 15 | Yellow |
| 1 | 2 | offer received | 0 | 5 | 3 | 0.0 | 2 | 10 | 7 | discount | ... | 70000 | 2018-04-26 | 2018 | 4 | 3 | 4 | 3 | 57.73 | 4 | Red |
| 2 | 5 | offer received | 0 | 8 | 4 | 0.0 | 5 | 5 | 5 | bogo | ... | 53000 | 2018-02-09 | 2018 | 4 | 2 | 3 | 3 | 36.43 | 6 | Red |
| 3 | 7 | offer received | 0 | 1 | 5 | 0.0 | 0 | 0 | 4 | informational | ... | 51000 | 2017-11-11 | 2017 | 3 | 2 | 2 | 3 | 15.62 | 9 | Red |
| 4 | 8 | offer received | 0 | 6 | 6 | 0.0 | 5 | 20 | 10 | discount | ... | 57000 | 2017-09-11 | 2017 | 4 | 2 | 5 | 6 | 85.55 | 11 | Yellow |
5 rows × 30 columns
# Merge cluster data into offer events
offers = pd.merge(offers, fmt.reset_index()[['user_id', 'cluster', 'customer_cluster']], on='user_id', how='left')
# Extract true offers
true_offers = offers.query('event == "offer received" and group != 3')
print(true_offers.shape)
true_offers.head(2)
(55004, 32)
| index | event | time | offer_id | user_id | amount | reward | difficulty | duration | offer_type | ... | signup_year | age_group | income_group | spend_group | frequency | monetary | tenure | user_tier | cluster | customer_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | offer received | 0 | 7 | 2 | 0.0 | 5 | 5 | 7 | bogo | ... | 2017 | 5 | 5 | 8 | 7 | 159.27 | 15 | Yellow | 0 | Female low spenders |
| 1 | 2 | offer received | 0 | 5 | 3 | 0.0 | 2 | 10 | 7 | discount | ... | 2018 | 4 | 3 | 4 | 3 | 57.73 | 4 | Red | 3 | Male low spenders |
2 rows × 32 columns
# Plot completion rate by cluster
plot_offer_completion(true_offers, hue='customer_cluster',hue_order=true_offers.groupby('customer_cluster').monetary.mean().sort_values().index)
del scaler, scaled, pca, user_pca, user_comps, \
overall_mean, cluster_mean, feat_importances, \
true_offers, feats, fig, ax, msd, k, \
gc.collect()
37495
In the barplot above, the customer clusters are ordered based on their average monetary value, i.e. low spenders - male spent the least and high earners spent the most money. We can see for most offers that as a group's average spending increases, so does the rate at which they complete offers.
The completion rates of low spenders - male, low earners and high earners look similar to those of the bronze, silver, and gold tiers from the FMT segmentation respectively. So the same suggestions would apply to these clusters.
profile.to_pickle('../data/output2/profile.pkl') # user data
transcript.reset_index().to_pickle('../data/output2/transcript.pkl') # event data
offers.to_pickle('../data/output2/offers.pkl') # transcript of offer-related events
fmt.to_pickle('../data/output2/fmt.pkl') # FMT data
joblib.dump(kmeans, '../data/output2/kmeans4.pkl')
profile.shape, transcript.shape, offers.shape, fmt.shape
((14825, 10), (272388, 7), (148431, 32), (14825, 27))